Sample Size Calculation of RNA-sequencing Experiment-A Simulation-Based Approach of TCGA Data

نویسنده

  • Derek Shyr
چکیده

Power and sample size calculation is an essential component of experimental design in biomedical research. For RNA-sequencing experiments, sample size calculations have been proposed based on mathematical models such as Poisson and negative binomial; however, RNA-seq data has exhibited variations, i.e. over-dispersion, that has caused past calculation methods to be underor over-power. Because of this issue and the field’s lack of a simulation-based sample size calculation method for assessing differential expression analysis of RNA-seq data, we developed this method and applied it to three cancer sites from the Tumor Cancer Genome Atlas. Our results showed that each cancer site had its own unique dispersion distribution, which influenced the power and sample size calculation. Citation: Shyr D, Li CI (2014) Sample Size Calculation of RNA-sequencing Experiment-A Simulation-Based Approach of TCGA Data. J Biomet Biostat 5: 198. doi:10.4172/2155-6180.1000198 J Biomet Biostat ISSN: 2155-6180 JBMBS, an open access journal Page 2 of 5 Volume 5 • Issue 3 • 1000198 that the researcher hopes to achieve in his or her experiment [8]. Typically, researchers try to aim for a power around 80%. While it would be ideal to have as many subjects as possible to ensure the quality and reproducibility of the results, costs must also be considered. Material and Methods Past methods for calculating RNA-seq sample size Methods of calculating sample size for RNA-seq gene differential expression experiments have been and are being developed. Unlike data sets like microarray that have continuous data, RNA-seq has count data and a skewed distribution. One of the distributions that have been used to model RNA-seq is the Poisson distribution. In [10], sample size formulas based on likelihood ratio test and score test were derived and the procedure of calculating sample size while controlling the false discovery rate (FDR) based on the Poisson distribution was developed. While Poisson may seem to be an appropriate model, the issue of the distribution lies with its critical assumption that the mean and variance must be equal. This assumption has proven to be problematic due to RNA-seq's over-dispersion (variance greater than mean); thus, the Poisson model for RNA-seq has the risk of underestimating the needed sample size, causing the study to be underpowered [10]. An alternative distribution has also been presented: negative binomial. Unlike Poisson, a special case of negative binomial, this distribution can not only model count data, but also have unequal mean and variance, allowing for over-dispersion. In [9], the paper's comparison between the Poisson and negative binomial distribution for the Transcript Regulation data set, which had significant over-dispersion, showed that the latter required a larger sample size than the former. This difference appeared to be more significant as the fold change increased, which, as a result, may signify negative binomial's flaw in overpowering an experiment's sample size. Other analytical methods for estimating RNA-seq sample size have also been developed. For example, [11] derived an explicit sample size formula by using the score test under generalized linear model framework. In this paper, we evaluated the sample size estimations of [10] and [9] by developing a simulation-based approach. Because our method is an empirical approach, we are not limited by any assumptions that the Poisson and negative binomial distribution require. Thus, our method can easily accommodate various RNA-seq data structure based on the data’s over-dispersion and fold change.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Sample size Determination for Longitudinal Studies with Continuous Response using Marginal Models

Introduction Longitudinal study designs are common in a lot of scientific researches, especially in medical, social and economic sciences. The reason is that longitudinal studies allow researchers to measure changes of each individual over time and often have higher statistical power than cross-sectional studies. Choosing an appropriate sample size is a crucial step in a successful study. A st...

متن کامل

PROPER: comprehensive power evaluation for differential expression using RNA-seq

MOTIVATION RNA-seq has become a routine technique in differential expression (DE) identification. Scientists face a number of experimental design decisions, including the sample size. The power for detecting differential expression is affected by several factors, including the fraction of DE genes, distribution of the magnitude of DE, distribution of gene expression level, sequencing coverage a...

متن کامل

A Graph-Based Clustering Approach to Identify Cell Populations in Single-Cell RNA Sequencing Data

Introduction: The emergence of single-cell RNA-sequencing (scRNA-seq) technology has provided new information about the structure of cells, and provided data with very high resolution of the expression of different genes for each cell at a single time. One of the main uses of scRNA-seq is data clustering based on expressed genes, which sometimes leads to the detection of rare cell populations. ...

متن کامل

Sample size calculation for differential expression analysis of RNA-seq data under Poisson distribution

Sample size determination is an important issue in the experimental design of biomedical research. Because of the complexity of RNA-seq experiments, however, the field currently lacks a sample size method widely applicable to differential expression studies utilising RNA-seq technology. In this report, we propose several methods for sample size calculation for single-gene differential expressio...

متن کامل

A Graph-Based Clustering Approach to Identify Cell Populations in Single-Cell RNA Sequencing Data

Introduction: The emergence of single-cell RNA-sequencing (scRNA-seq) technology has provided new information about the structure of cells, and provided data with very high resolution of the expression of different genes for each cell at a single time. One of the main uses of scRNA-seq is data clustering based on expressed genes, which sometimes leads to the detection of rare cell populations. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014